Introducing Covariate Tagging: Outperforming Zero-Shots with Raw Encoders

Michelangiolo Mazzeschi • 2024-12-17

Raw-encoders cannot be used classification, this article shows how to make this possible

The zero-shots models

***To understand this article, knowledge of embeddings, clustering, and zero-shots systems is required. The implementation of this algorithm has been released on GitHub and is fully open-source. I am open to criticism and welcome any feedback.

In this article, I am going to introduce a novel approach to labeling called covariate tagging or covariate labeling. At the moment, the two most common approaches to labeling using machine learning are:

Zero-shots
LLM tagging

Despite LLMs being exceptional at labeling samples (an advantage given by their reasoning abilities), their computational demands are in the order of magnitude greater than the first option.

Zero-shots models, on the other hand, are much more lightweight (they can pretty much run on a single computer), and their performance is suboptimal.

In addition, neither options allow for a large number of choices. If, for example, we had to choose between a list of over 50 tags, we would need to start employing semantic approximation methods such as clustering to cap the number of possible tags. Even the best zero-shots models might need several minutes of processing to run a large number of tags.

In conclusion, the current technology offers no viable option to perform zero-shot labeling at scale.

What about raw encoders?

So far, we have not mentioned a third labeling method: raw encoders. This approach consists in using cosine similarity to determine which ones are the top labels among the list. This method is considered so bad that is not even attempted. Today, we are going to turn the tides and put it back among the top-tier list of labeling tools.

Why are raw encoders so bad?

There are two limitations that cripple this approach:

Cosine similarity is not a reliable classification metrics It is very common for cosine similarity to mistake a label for another, and give results that make absolutely no sense. This is the main reason why it has been abandoned from any classification task.
We do not know how to stop knn
There is no optimal number of neighbors (unless you employ advanced techniques like adaptive-knn, which has never been experimented on this approach, to my knowledge). No matter how many labels we throw to the model, they will just be given a score, and there is no proper way of selecting the top k tags (being optimal k unknown).

Covariate Tagging

This new approach is built on top of covariate encoding to assess different chunks of labels, and follows a sequence of steps that can solve both problems outlined above. The algorithm is organized into 6 steps, and at the end it should prove as a valid approach to label any encoded data, relying on a huge number of tags (even 1000).

Our goal is to propose a solution that can greatly outperform zero-shots, and despite not being a match with the reasoning abilities of LLMs, it can still outperform them in terms of input limit (LLMs can maybe process 50 tags, covariate tagging can handle thousands without breaking a sweat).

Preparation

To create our zero-shot model, we need proper samples and a number of tags used for classification. One dataset that I have been using in many tagging situations is the Steam game dataset, which contains over 40.000 game descriptions and only 446 tags (which is enough to demonstrate the capabilities of covariate tagging, but not excessive to require further processing).

Our goal is to improvise a classifier capable of converting any textual game description into a set of labels chosen from the list of 446 tags. Note that this approach uses a pre-trained raw encoder, and does not require any training on our end.

['Dragons',
 'Tactical RPG',
 'Nudity',
 'Villain Protagonist',
 'Colony Sim',
 'Tutorial',
 'Lemmings',
 'Movie',
 'Dungeon Crawler',
 'Sci-fi',
 'Tactical',
 'Asymmetric VR',
 ...
 ]

In our experiment, we will try to find the labels for the following game description:

'MazM: Jekyll and Hyde' is a darkly entertaining adventure game based on the classic 1886 novel 'The Strange Case of Dr.
Jekyll and Mr. Hyde' by Robert Louis Stevenson, in which you'll tackle the mystery from a totally new angle! You'll
travel back to 19th century London and view the city through the eyes of Mr. Utterson, a lawyer that walks the true path
hunting for clues to solve a disturbing mystery, and Mr. Hyde, who has been pushed to his physical limits. Wander the
streets of this psychological thriller and prepare for an ending you would never expect! The version of London presented
in 'MazM: Jekyll and Hyde' has a dark, heavy atmosphere, creating a sense of the eerie and macabre. The stunning artwork
and unsettling music help to further intensify the disturbing nature of the game. Travel throughout London searching for
clues, and allow the world of this classic novel to envelop you as you experience the tale of one man's many challenges
and potential downfall!

1. Sort tags by similarity

Our first step will be to perform a knn on all tags on the game description vector. The results will be the original tag list, but sorted by similarity: the first tag in our sorted list will be the most relevant to the query, while the last one the most irrelevant.

['Based On A Novel',
 'Villain Protagonist',
 'Mystery Dungeon',
 'RPG',
 'Adventure',
 'Action RPG',
 'Thriller',
 'Silent Protagonist',
 'Action-Adventure',
 'Strategy RPG',
 'Interactive Fiction',
 'Dungeons & Dragons',
 'Dark Fantasy',
 ...
 ]

2. Group tags with a scrolling window

Our second step will be to group them using a scrolling windows of an arbitrary number. This step allows us to somehow “cluster” the samples together. The assumption is that because the tags in our zero-shot list are not unique, there will be multiple labels that are semantically similar, and they will probably appear together.

Once the similarity sorting has been applied, we can notice how labels of the same group will be chunked together. Note that this process cannot be replaced by clustering, this sorting is relative to our input score, while clustering will only look at the relationships between labels.

[
	['Based On A Novel', 'Villain Protagonist', 'Mystery Dungeon', 'RPG', 'Adventure'],
	['Action RPG', 'Thriller', 'Silent Protagonist', 'Action-Adventure', 'Strategy RPG'],
	['Interactive Fiction', 'Dungeons & Dragons', 'Dark Fantasy', 'Visual Novel', 'Mahjong'],
	['JRPG', 'Dungeon Crawler', 'Choose Your Own Adventure', 'Episodic', 'Tactical RPG'],
	['Comic Book', 'Party-Based RPG', 'Immersive', 'Character Action Game', 'Psychological Horror'],
	['Exploration', 'RPGMaker', 'Escape Room', 'Dystopian', 'Detective'],
	['Dark Comedy', 'Immersive Sim', 'Survival Horror', 'MMORPG', 'Dynamic Narration'],
	...
]

3. Covariate encode each window

Any attempt to use a raw encoder for a classification task has resulted in failure for a very simple reason: we can only compute the similarity between the input text and a single label (one-to-one approach). However, with covariate encoding we can perform a many-to-many similarity comparison: we do not focus on each label individually, but we can target groups of labels.

4. Select the top k windows by similarity

By encoding our windows we can give a similarity score for each group of tags comparing it with our input vector, and sort the windows in ascending order. We can now select the top k windows. We will only take into account the labels contained in these few groups.

[
	['Based On A Novel', 'Villain Protagonist', 'Mystery Dungeon', 'RPG', 'Adventure'],
	['Interactive Fiction', 'Dungeons & Dragons', 'Dark Fantasy', 'Visual Novel', 'Mahjong'],
	['Action RPG', 'Thriller', 'Silent Protagonist', 'Action-Adventure', 'Strategy RPG'],
	['JRPG', 'Dungeon Crawler', 'Choose Your Own Adventure', 'Episodic', 'Tactical RPG'],
	['Comic Book', 'Party-Based RPG', 'Immersive', 'Character Action Game', 'Psychological Horror'],
	['Exploration', 'RPGMaker', 'Escape Room', 'Dystopian', 'Detective'],
	['Dark Comedy', 'Immersive Sim', 'Survival Horror', 'MMORPG', 'Dynamic Narration'],
	...
]

5. Recursive Label Association

The final step is using a novel algorithm called recursive label association. This technique can only work when the subsamples are subject to a covariate encoding. The previous steps were simply a filtering process to select the top k relevant tags: we have successfully reduced the number of zero-shots candidates from 446 to 100:

['Zombies',
 'Word Game',
 'Warhammer 40K',
 'Wargame',
 'Walking Simulator',
 'Visual Novel',
 'Villain Protagonist',
 'Transhumanism',
 'Trains',
 'Traditional Roguelike',
 'Time Travel',
 'Time Management',
 'Thriller',
 'Third-Person Shooter',
 'Tactical RPG',
 'Swordplay',
 ...
]

This process starts by creating an empty list. For the first iteration, we will fill the list with each tag individually and use covariate encoding to compare the score between the input vector and each group, then select the best score for each iteration.

For all subsequent iterations, we will repeat this process, testing out all the remaining tags. The list will keep growing until reaching the score threshold.

The algorithm has successfully selected the top-k labels from our input text, and it has run in just 1 second! Something we have never seen before (by simply using a raw encoder!).

['Based On A Novel',
 'Mystery Dungeon',
 'Mahjong',
 'Villain Protagonist',
 'Interactive Fiction',
 'JRPG']

If we compare the tags obtained through our novel methods with the ones sorted by kn, we realize the big performance difference (the results get more impressive when there are group of tags that are similar to each other sorted at the top, which is the biggest flaw or raw encoders: in such cases only a tag per group gets selected, as it should be).

['Based On A Novel',
 'Villain Protagonist',
 'Mystery Dungeon',
 'RPG',
 'Adventure',
 'Action RPG',
 'Thriller',
 'Silent Protagonist',
 'Action-Adventure',
 'Strategy RPG',
 'Interactive Fiction',
 'Dungeons & Dragons',
 'Dark Fantasy',
 ...]

Also note that with covariate tagging we simply need to define a score threshold to limit the amount of results. With regular knn, similarity scores are never consistent, making it impossible to choose a threshold.

Another example

If you are not impressed by the result, let us look at another example:

THE LAW!! Looks to be a showdown atop a train. This will be your last fight. Good luck, Train Bandit. WHAT IS THIS GAME?
Train Bandit is a simple score attack game. The Law will attack you from both sides. Your weapon is your keyboard.
You'll use those keys to kick the living shit out of the law. React quickly by attacking the correct direction.
React...or you're dead. THE FEATURES Unlock new bandits Earn Achievements Become Steam's Most Wanted ? Battle elite
officers Kick the law's ass

The following are the results of covariate tagging:

['Crime',
 'Tactical RPG',
 'Vehicular Combat',
 'Steam Machine',
 'Turn-Based Tactics',
 'Job Simulator',
 'Hack and Slash',
 'Battle Royale']

If we look at the sorted knn, we see once again that a proper selection has been taking place:

['Crime',
 'Combat',
 'Tactical RPG',
 'Battle Royale',
 'Fighting',
 'Combat Racing',
 'Vehicular Combat',
 'Character Action Game',
 'Strategy RPG',
 'Detective',
 'Tactical',
 'Martial Arts',
 'Swordplay',
 'Action RPG',
 ...
 ]

Evaluation

As beautiful (and fast) as it is, let us not forget that we are relying on a raw encoder. This means that some terms will still be classified as ambiguous, while the room for error is still marginally high.

Let us compare the same labeling assigned using the best open-source zero-shots models:

bart-large-mnli

The results are not actually bad, the major issue is that the entire process took 7 minutes! (which makes scalability quite challenging).

['Based On A Novel',
  'Dark',
  'Adventure',
  'Historical',
  'Gaming',
  'Psychological',
  'Immersive',
  'Thriller',
  'Stylized',
  'Psychological Horror',
  'Atmospheric',

zero-shot-implicit-bi-encoder

This zero-shot model is one of the latest open-sourced models, and the processing of all 446 tags only took 6 seconds (substantial improvement compared with traditional systems).

[(0.871, 'Feature Film'),
 (0.851, 'Thriller'),
 (0.845, 'Documentary'),
 (0.832, 'Dark Comedy'),
 (0.813, 'Action RPG'),
 (0.791, 'Cinematic'),
 (0.774, "1990's"),
 (0.731, 'Action Roguelike'),
 (0.706, 'RPG'),
 (0.696, 'On-Rails Shooter'),
 (0.695, 'Based On A Novel'),
 (0.691, 'Drama'),
 (0.629, 'Movie'),
 ...

However, we can agree that the results are suboptimal (it seems like it is overly generalizing the text).

To be really satisfied with our classification we should be using LLMs, but, as explained, those models belong to a completely different generation.

Conclusion

I might be bold enough to call this a breakthrough. This should open a new door for zero-shots applications, as it shows that there is no need for training ad hoc neural networks to perform any proper labeling.

In Conclusion, our approach:

Breaks all records for zero-shot input size
Outperforms the accuracy of zero-shots, but not LLMs
Is, at least, 100x faster than any other method

See More Posts

Cardy

Company

Twitter

team@typedream.com

Legal